Its a well-known aphorism within data science that the majority of an analysis consists in cleaning and validating our data. If we can collect our data in a clean format from the start, it will allow us to proceed directly to the exploration stage once the data have been collected.
There are number of excellent articles that give an extensive overview of how to collect and organize data. Hadley Wickham’s “Tidy Data”, one of the most citing papers across all of data science, offers an extensive theoretical framework for describing a process for collecting datasets. Karl Broman and Kara Woo’s “Data Organization in Spreadsheets”. offers a balanced between practical advice and an extended discussion of general principles for collecting datasets. The Data Carpentry guide Data Organization in Spreadsheets provides a concise list of common pitfalls.
This chapter provides a summarized set advice for organizing and storing data within a spreadsheet program. Rather than an extensive discussion of various pros and cons, it primarily focuses on the explicit approaches that I recommend. For readers interested in a broader coverage, I suggest reading the articles cited above. Because we are not using any fancy spreadsheet functions here, any program that you would like to use should be fine. The screenshots come from Microsoft Excel, but the same approach will work in Google Sheets, LibreOffice, or another spreadsheet program of your choosing.
In Chapter @ref(intror) the concept of a rectangular dataset, with observations in rows and variables in columns, was introduced. This is the same format that we will use to collect our data. The first thing you will need to do, then, is determine what things you are observing and what properties you would like to collect about each thing. If you are observing different different kinds of things, each of which has a different set of associated properties, you may need to store each set in a different table.
To match the format of rectangular data that we have been working with in R, we need to structure our dataset with a single row of column names, followed by a row of data for each observation. Here is an example from Excel of a nonsense dataset with three variables:
Notice that we need to always start in the first cell, A1, and fill in a consistent number of rows and columns. We do not have multiple tables scattered around the spreadsheet. We do not have multiple header columns. It is just the data itself, stored in an contained rectangle in the upper-left hand corner of our spreadsheet. It is okay to include a small amount of formatting of the cells to help with the data-entry process (I like the make the first row bold), but do not try to record measurable properties of your data with formatting.
It is important to choose good variable names. As we have seen, the variable names in a dataset are used to describe the graphics in the grammar of graphics and for manipulating data with verbs. If our names are too complex or difficult to remember, it will be more difficult to create data visualizations. When variables contain spaces or other special characters, it can become nearly impossible to work within R without first cleaning up the variable names after loading the data.
I find the best approach to variable names is to only use lower-case letters, numbers, and underscores. The underscores can be used in place of spaces, but do not make variable names more complex than needed. Also, make sure to start the name of a variable with a lower-case letter (starting with a number is invalid in R and many other programming languages). Note that I recommend using lowercase even for variable names that should be capitalized, such as acronyms and proper nouns. If you selectively capitalize, you will always need to remember where this was done. In my scheme, it is one less thing to remember.
Throughout the rest of this chapter, continuing with the food theme, we will show examples of a small dataset collecting information about a set of cookbooks. Here is the table before filling in any information, with just the row names.
For the specific variables within a dataset, spaces, capital letters, and other special characters are fine. Just be consistent. Do not use “female” in one observations, “Female” in another, and “F” in a third. Just pick the format and stick to it. Where applicable, try to use standards-based labels (ISO country codes or U.S. Postal service state abbreviations). This will help if you later want to merge your dataset with other sources.
Within each cell, there should only be one piece of information. In particular, this means that cells should not contain units or current symbols. If you have data collected in different units, create a new column and put the units there. Though, it is best to store everything on the same scale. If there is something to note about a particular value, do not put a star or other mark with it. Create a new column. This also means that, as mentioned above, you should not try to store two things in one cell by using formatting to indicate a secondary piece of information. Again, if you have two things to indicate, create a new column. Here is an example of our cookbook dataset with these principles applied:
If you need to include explanatory notes for some of the data, which is often a great idea, do not abandon the rectangular data format. Instead, include an extra column of notes with its own name. For example, here we explain that one of our books is out of print:
In our example table, the number of pages and weight of one book is missing because it is out of print. In order to indicate this, the corresponding cell is blank. Blank values are the only cross-software way to indicate missing values in a consistent way.
Date variables are a well-known source of error in the collecting and recording of data. My recommendation for the most flexible and least error-prone method is to simply record each component of a date as its own column. This means, one column for year, one for month, and one for day. It will be much easier to work with if you keep months as numbers rather than names. If you have time information as well, this can be recorded by putting the hours, minutes, and seconds as their own columns.
One benefit of this method is that it will be easy to record historical data in cases where you may not be sure of the month or day for every row of the dataset. For example, here is a dataset showing properties of the cookbook authors from our dataset:
There is an ISO standard (ISO 8601) for representing dates and times in a consistent way: YYYY-MM-DD. This is often a great format for storing data, and one that is used in several example datasets for this book, but (1) can lead to errors when opening and re-saving in a spreadsheet program and (2) cannot easily store dates with unknown information. I suggest using the separate column approach while collecting your initial dataset. Later, if you re-save a modified dataset from within R, the ISO 8601 format is a good option.
Most sources on collecting data suggest storing your results in a plaintext format. This is a stripped down representation of the data that contains no formatting information and is application agnostic. Excel, GoogleSheets, LibreOffice, and any other spreadsheet program you come across, should be able to save your dataset in a plaintext format. The most commonly used type is called a comma seperated value (or “csv”) file. Here, columns are split by commas and each row is on its own line. Here is what the csv file of our cookbook authors dataset looks like:
variable,long_name,units,description
book_title,Book Title,,"Book name, given in English"
author,Author's name,,Authors given and family name
pages,Number of pages,,Pages in currently available edition
weight,Book weight,grams,Weight of currently available edition
nationality,Author's Nationality,,"Using ISO two letter country codes (i.e., CA)"
birth_year,Author's birth year,,As numeric variable
birth_month,Author's birth month,,As numeric variable
birth_day,Author's birth day,,As numeric variable
Nearly all of the dataset provided with this book are stored as CSV files. One thing to be careful of, particularly when using Excel, is that if your computer is configured for a language that uses a comma as a decimal separator, the default csv output may actually use a semicolon (;) in place of a comma. To read these files in R, just replace the read_csv with the function read_csv2.
Unlike some other sources, I am less strict about the need to only export data as a plaintext file. This is the best way for sharing and storing a dataset once an analysis is finished, but if you are going to continue adding and changing your dataset, it may actually be preferable to store your data in an .xlsx file (it avoids errors that are introduced when converting back and forth between excel and plaintext formats). Data can be loaded directly from an excel file with the readxl package. Here is the syntax for using the package to read in a dataset, with either the first sheet or a named sheet:
library(readxl)
data <- read_excel("authors.xlsx") # load the first sheet
data <- read_excel("authors.xlsx", sheet = "sheetname") # load sheet by nameOnce you are finished with the data collection and cleaning processing, then it is a good idea to store your data in a plaintext format for sharing and long-term preservation.
So far, our discussion has focused on the specifics of storing the data itself. It is also important to document exactly what information is being stored in your variables. To do this, we can construct a data dictionary. This should explain, for each variable, basic information such as the variable’s name, measurement units, and expected categorical values. Any decisions that needed to be made should also be documented. A data dictionary can be a simple text file, or can be stored itself as a structured dataset. Here is an example of a data dictionary for our authors dataset:
We included a long name for each variable, which will be useful when creating improved graphics labels when preparing data visualizations for publication.
There is a lot of information passed along within this chapter. For future reference, here are the key formatting guidelines for storing datasets in spreadsheets:
xlsx file while in the middle of data collection; save as a csv file for long-term storage and sharingAs mentioned in the introduction, this is an opinionated list and some other options are equally valid. The most important thing, however, is consistency, so in this book I will expect that you follow the formatting advice here when collecting data.
There is are two exercise notebooks for this chapter:
For more information about collecting data, I recommend the following references:
It will also be helpful to return to these notes after working through the notion of wide and long data formats in Chapter 10.